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Abstract 

An algorithm for optimization of signal significance or any other classification figure of merit 
suited for analysis of high energy physics (HEP) data is described. This algorithm trains decision 
trees on many bootstrap replicas of training data with each tree required to optimize the signal 
significance or any other chosen figure of merit. New data are then classified by a simple majority 
vote of the built trees. The performance of this algorithm has been studied using a search for the 
radiative leptonic decay B — > 'jlv at BaBar and shown to be superior to that of all other attempted 
classifiers including such powerful methods as boosted decision trees. In the B — > ^eu channel, the 
described algorithm increases the expected signal significance from 2.4cr obtained by an original 
method designed for the B ^Iv analysis to 3.0o". 
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1. INTRODUCTION 



Separation of signal and background is perhaps the most important problem in analy- 
sis of HEP data. Various pattern classification tools have been employed in HEP practice 
to solve this problem. Fisher discriminant ^ and feedforward backpropagation neural net- 
works are the two most popular methods chosen by HEP analysts atpresent. Alternative 
algorithms for classification such as decision trees (3j], bump hunting j^, and AdaBoost 
have been recently explored by the HEP community as well 0, 0, 13 • These classifiers can 
be characterized by such features as predictive power, interpretability, stability and ease 
of training, CPU time required for training and classifying new events, and others. It is 
important to remember that the choice of a classifier for each problem should be driven by 
specifics of the analysis. For example, if the major goal of pattern classification is to achieve 
a high quality of signal and background separation, flexible classifiers such as AdaBoost and 
neural nets should be the prime choice. While neural nets generally perform quite well in 
low dimensions, they become too slow and unstable in high-dimensional problems losing the 
competition to AdaBoost. If the analyst, however, is mostly concerned with a clear inter- 
pretation of the classifier output, decision trees and bump hunting algorithms are a more 
appealing option. These classifiers produce rectangular regions, easy to visualize in many 
dimensions. 

One of the problems faced by HEP analysts is the indirect nature of available classifiers. 
In HEP analysis, one typically wants to optimize a figure of merit expressed as a function 
of signal and background, S and B, expected in the signal region. An example of such 
figure of merit is signal significance, S/ \/S + B, often used by physicists to express the 
cleanliness of the signal in the presence of statistical fluctuations of observed signal and 
background. None of the available popular classifiers optimizes this figure of merit directly. 
CART a popular commercial implementation of decision trees, splits training data into 
signal- and background-dominated rectangular regions using the Gini index, Q = 2p{l —p), 
as the optimization criterion, where p is the correctly classified fraction of events in a tree 
node. Neural networks typically minimize a quadratic classification error, Sqaa = 
/(x„))^, where ?/„ is the true class of an event, -1 for background and 1 for signal, is the 

continuous value of the class label in the range [—1, 1] predicted by the neural network, and 
the sum is taken over events in the training data set. Similarly, AdaBoost minimizes an 
exponential classification error, £^exp = S^=i exp(— These optimization criteria 
are not necessarily optimal for maximization of the signal significance. The usual solution 
is to build a neural net or an AdaBoost-based classifier and then find an optimal cut on the 
continuous output of the classifier to maximize the signal significance. For decision trees, 
the solution is to construct a decision tree with many terminal nodes and then combine 
these nodes to maximize the signal significance. 

This problem has been partially addressed in my C++ software package for pattern clas- 
sification 13 . Default implementations of the decision tree and the bump hunting algorithm 
include both standard figures of merit used for commercial decision trees such as the Gini 
index and HEP-specific figures of merit such as the signal significance or the signal pu- 
rity, S/{S + B). The analyst can optimize an arbitrary figure of merit by providing an 
implementation to the corresponding abstract interface set up in the package. 

AdaBoost and the neural net, however, cannot be modified that easily. The functional 
forms of the classification error are intimately tied to implementations of these two classi- 
fication algorithms. Finding a powerful method for optimization of HEP-specific figures of 
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merit is therefore an open question. 

This note describes an algorithm that can be used for direct optimization of an arbitrary 
figure of merit. Optimization of the signal significance by this algorithm has shown results 
comparable or better than those obtained with AdaBoost or the neural net. The training 
time used by this algorithm is comparable to that used by AdaBoost with decision trees; 
the algorithm is therefore faster than the neural net in high dimensions. The method has 
been coded in C++ and included in the StatPatternRecognition package available for free 
distribution to HEP analysts. 



2. BAGGING DECISION TREES 

The implementation of decision trees used for the proposed algorithm is described in detail 
in Ref. |^. The key feature of this implementation is its ability to optimize HEP-specific 
figures of merit such as the signal significance. 

A decision tree, even if it directly optimizes the desired figure of merit, is rarely powerful 
enough to achieve a good separation between signal and background. The tree produces 
a set of signal-dominated rectangular regions. Rectangular regions, however, often fail to 
capture a non-linear structure of data. The mediocre predictive power of a single decision 
tree can be greatly enhanced by one of the two popular methods for combining classifiers — 
boosting and bagging. 

Both these methods work by training many classifiers, e.g., decision trees, on variants of 
the original training data set. A boosting algorithm enhances weights of misclassified events 
and reduces weights of correctly classified events and trains a new classifier on the reweighted 
sample. The output of the new classifier is then used to re-evaluate fractions of correctly 
classified and misclassified events and update the event weights accordingly. After training 
is completed, events are classified by a weighted vote of the trained classifiers. AdaBoost, a 
popular version of this approach, has been shown to produce a high-quality robust training 
mechanism. Application of AdaBoost to HEP data has been explored in Refs. 

In contrast, bagging algorithms do not reweight events. Instead, they train new 
classifiers on bootstrap replicas of the training set. Each bootstrap replica [l^ is obtained 
by sampling with replacement from the original training set, with the size of each replica 
equal to that of the original set. After training is completed, events are classified by the 
majority vote of the trained classifiers. For successful application of the bagging algorithm, 
the underlying classifier must be sensitive to small changes in the training data. Otherwise 
all trained classifiers will be similar, and the performance of the single classifier will not be 
improved. This condition is satisfied by a decision tree with fine terminal nodes. Because 
of the small node size each decision tree is significantly overtrained; if the tree were used 
just by itself, its predictive power on a test data set would be quite poor. However, because 
the final decision is made by the majority vote of all the trees, the algorithm delivers a high 
predictive power. 

Various kinds of boosting and bagging algorithms have been compared in the statistics 
literature. Neither of these two approaches has a clear advantage over the other. On average, 
boosting seems to provide a better predictive power. Bagging tends to perform better in the 
presence of outliers and significant noise 

For optimization of the signal significance, however, bagging is the choice favored by 
intuition. Reweighting events has an unclear impact on the effectiveness of the optimization 
routine with respect to the chosen figure of merit. While it may be possible to design 



3 



a reweighting algorithm efficient for optimization of a specific figure of merit, at present 
such reweighting algorithms are not known. Bagging, on the other hand, offers an obvious 
solution. If the base classifier directly optimizes the chosen figure of merit, bagging is 
equivalent to optimization of this figure of merit integrated over bootstrap replicas. In 
effect, the bagging algorithm finds a region in the space of physical variables that optimizes 
the expected value of the chosen figure of merit — exactly what the analyst is looking for. 

Bagging decision trees is certainly not a new item in the statistics research. The only 
novelty introduced in this note is the decision tree designed for direct optimization of an 
arbitrary figure of merit, e.g., the signal significance. 
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FIG. 1: Separation variables for the B 'jlv analysis. Signal MC is shown with a solid line 
(triangles in the numLepton plot), and the overall combined background is shown with a dashed 
line (squares in the numLepton plot). 
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TABLE I: Signal significance, <Sij.g^[-^, '^valid' '^test' ^ ~^ training, validation, and 

test samples obtained with various classification methods. The signal significance computed for the 
test sample should be used to judge the predictive power of the included classifiers. A branching 
fraction of 3 x 10~^ was assumed for both B and B 'yeu decays. Wi and Wq represent the 

signal and background, respectively, expected in the signal region after the classification criteria 
have been applied; these two numbers have been estimated using the test samples. All numbers 
have been normalized to the integrated luminosity of 210 fb^^. The best value of the expected 
signal significance is shown in boldface. 
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FIG. 2: Output of the bagging algorithm with 100 trained decision trees (left) and the signal 
significance versus the lower cut on the output (right) for the B — > 'yeu test sample. The cut 
maximizing the signal significance, obtained using the validation sample, is shown with a vertical 
line. 
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Performance of the described bagging algorithm has been studied using a search for the 
radiative leptonic decay B '-flu at BaBar. Eleven variables used for classification in 
this analysis are shown in Fig. ^ Several methods have been used to separate signal from 
background by maximizing the signal significance: an original method developed by the 
analysts, the decision tree optimizing the signal significance, the bump hunting algorithm, 
AdaBoost with binary splits, AdaBoost with decision trees optimizing the Gini index, and 
an AdaBoost-based combiner of background subclassifiers. I also attempted to use a feedfor- 
ward backpropagation neural network with one hidden layer, but the network was unstable 
and it failed to converge to an optimum. A more detailed description of this analysis and 
used classifiers can be found in Ref. Q. 

To test the bagging algorithm described in this note, I trained 100 decision trees on 
bootstrap replicas of the training data. For classification of new data, the trained trees were 
combined using an algebraic sum of their outputs: if an event was accepted by a tree, the 
output for this event was incremented by 1 and decremented by 1 otherwise. The minimal 
size of the terminal node in each tree, 100 events for both B —>■ '-feu and B — »• 7/21/ channels, 
was chosen by comparing values of the signal significance computed for the validation data. 
The size of the trained decision trees varied from 390 to 470 terminal signal nodes in the 
B '-feu channel and from 300 to 370 in the B 'jfiu channel. Jobs executing the 
algorithm took several hours in a batch queue at SLAG. To assess the true performance of 
the method, the signal significance was then evaluated for the test data. 

All attempted classifiers are compared in Table I. The output of the described bagging 
algorithm for the B '-feu test data is shown in Fig. |21 The bagging algorithm provides the 
best value of the signal significance. It gives a 24% improvement over the original method 
developed by the analysts and shown in the first line of Table I, and a 14% improvement 
over AdaBoost with decision trees shown in line 5 of Table I; both numbers are quoted for 
the B ^ ^eu channel. 

I also used AdaBoost with decision trees optimizing the signal significance and the bag- 
ging algorithm with decision trees optimizing the Gini index. The first method performed 
quite poorly; the signal significance obtained with this method was much worse than that 
obtained by AdaBoost with decision trees optimizing the Gini index. The bagging algorithm 
with decision trees optimizing the Gini index showed an 8% improvement in the B —>■ 'jeu 
signal significance compared to AdaBoost with decision trees optimizing the Gini index. 
But the signal significance obtained with this method was 9% worse than that obtained by 
the bagging algorithm with decision trees optimizing the signal significance. The 14% im- 
provement of the proposed bagging algorithm over AdaBoost with decision trees originated 
therefore from two sources: 

• Using bagging instead of boosting. 

• Using the signal significance instead of the Gini index as the figure of merit for the 
decision tree optimization. 

In an attempt to improve the signal significance even further, I used the random forest 
approach jl^], a more generic resampling method. In addition to generating a new bootstrap 
replica for each tree, I resampled the data variables used to split each node of the tree. 
Because a bootstrap replica contains on average 63% of distinct entries from the original 
set, only 6.9 variables out of 11 were used on average to split the tree nodes. This approach 
showed only a minor 1% improvement in the B —>■ "feu signal significance over the bagging 
algorithm without variable resampling. 
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As shown in Fig. |21 the described bagging algorithm does not provide a good separation 
between signal and background in terms of the quadratic or exponential classification error. 
It misclassifies a large fraction of signal events. However, the method does the job it was 
expected to do — it finds a region in the space of physical variables that, on average, 
maximizes the signal significance. 

3. SUMMARY 

A bagging algorithm suitable for optimization of an arbitrary figure of merit has been 
described. This algorithm has been shown to give a significant improvement of the signal 
significance in the search for the radiative leptonic decay B 7/1/ at BaBar. Included in 
the StatPatternRecognition package this method is available to HEP analysts. 
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